Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license

نویسندگان

Matej Korvas

Ondrej Plátek

Ondrej Dusek

Lukás Zilka

Filip Jurcícek

چکیده

We present a dataset of telephone conversations in English and Czech, developed to train acoustic models for automatic speech recognition (ASR) in spoken dialogue systems (SDSs). The data comprise 45 hours of speech in English and over 18 hours in Czech. All audio data and a large part of transcriptions was collected using crowdsourcing; the rest was transcribed by hired transcribers. We release the data together with scripts for data pre-processing and building acoustic models using the HTK and Kaldi ASR toolkits. We publish the trained models described in this paper as well. The data are released under the CC-BY-SA 3.0 license, the scripts are licensed under Apache 2.0. In the paper, we report on the methodology of collecting the data, on the size and properties of the data, and on the scripts and their use. We verify the usability of the datasets by training and evaluating acoustic models using the presented data and scripts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition

We present two recently released opensource taggers: NameTag is a free software for named entity recognition (NER) which achieves state-of-the-art performance on Czech; MorphoDiTa (Morphological Dictionary and Tagger) performs morphological analysis (with lemmatization), morphological generation, tagging and tokenization with state-of-the-art results for Czech and a throughput around 10-200K wo...

متن کامل

CzEng: Czech-English Parallel Corpus release version 0.5

We introduce CzEng 0.5, a new Czech-English sentence-aligned parallel corpus consisting of around 20 million tokens in either language. The corpus is available on the Internet and can be used under the terms of license agreement for non-commercial educational and research purposes. Besides the description of the corpus, also preliminary results concerning statistical machine translation experim...

متن کامل

Model-free control of non-minimum phase systems and switched systems

This brief presents a simple derivation of the standard model-free control for the non-minimum phase systems. The robustness of the proposed method is studied in simulation considering the case of switched systems. This work is distributed under CC license http://creativecommons.org/licenses/ by-nc-sa/3.0/ ar X iv :1 10 6. 16 97 v1 [ m at h. O C ] 9 J un 2 01 1

متن کامل

POLYCOST: A telephone-speech database for speaker recognition

This article presents an overview of the POLYCOST database dedicated to speaker recognition applications over the telephone network. The main characteristics of this database are: large mixed speech corpus size (> 100 speakers), English spoken by foreigners, mainly digits with some free speech, collected through international telephone lines, and more than eight sessions per speaker.

متن کامل

Inter-Annotator Agreement on Spontaneous Czech Language

The goal of this article is to show that for some tasks in automatic speech recognition (ASR), especially for recognition of spontaneous telephony speech, the reference annotation differs substantially among human annotators and thus sets the upper bound of the ASR accuracy. In this paper, we focus on the evaluation of the inter-annotator agreement (IAA) and ASR accuracy in the context of imper...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license

نویسندگان

چکیده

منابع مشابه

Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition

CzEng: Czech-English Parallel Corpus release version 0.5

Model-free control of non-minimum phase systems and switched systems

POLYCOST: A telephone-speech database for speaker recognition

Inter-Annotator Agreement on Spontaneous Czech Language

عنوان ژورنال:

اشتراک گذاری